model: support Longcat-Flash (help wanted) by ngxson · Pull Request #19182 · ggml-org/llama.cpp

ngxson · 2026-01-29T17:34:59Z

I was working on #19167 but realized that the normal (non-ngram) model is not even supported yet.

Thinking it will be simple, I gave it a try, but ended up stuck at implementing their notion of "zero-computing experts" (ref: link to paper)

The main problem is that ggml_mul_mat_id isn't made for this purpose and I have no idea how to adapt it, or which ops may need to be added to make it work.

To illustrate what's the problem, I will take an example of how a normal MoE FFN work:

Calculate expert probs using the router
Sort & get top_k experts
Do FFN gate/up/down with the selected top_k experts, this is done via ggml_mul_mat_id
Weighted sum the output

This means we spend the the same amount of computation for each token, proportionally to n_expert_used

However, with longcat-flash:

After top_k expert, ONLY experts with ID < n_zero_experts will go through FFN; for the rest, they skip the FFN altogether
This makes the amount of computation to be varied token-by-token. For example: one token can use n FFN expert, while another can use n-1, another can use 0 FFN expert (in other words, skipping the MoE altogether)

Apart from the weird MoE, the model has double block architecture, meaning there are 2 attentions and 2 FFNs per layer. Upon converting to GGUF, we convert it to a model of 2 * n_layer, which make the implementation much easier.

ggerganov · 2026-01-29T17:59:20Z

Huh, interesting. Likely need to extend ggml_mul_mat_id with more general alpha_i*A^B + beta_i*C and implement special casing in the kernels for alpha_i=0 to skip the compute.

hebangwen · 2026-01-30T03:08:22Z

Huh, interesting. Likely need to extend ggml_mul_mat_id with more general alpha_i*A^B + beta_i*C and implement special casing in the kernels for alpha_i=0 to skip the compute.

Hello, I'm also paying attention to the adaptation of the loncat-flash model. Suppose B is the expert weight. If we change mul_mat_ids to alpha * A^B + beta * C, since the weights of zero-compute experts are not saved and the experts activated for each token are different, there might be an interleaved expert activation scenario. In this case, we need to determine for each token whether the activated expert is a zero-compute expert. Would this have an impact on performance?

If we bypass the requirement of ggml_mul_mat_ids to arrange in token order and add a new operator that rearranges the activations in expert order, like torch._grouped_gemm, would this provide better adaptability? In this case, we would only need to determine whether the currently computed expert ID is valid or whether it is a zero-compute expert. See #18369

ggerganov · 2026-01-30T08:10:40Z

@hebangwen Not sure I follow, but I think simply setting the coefficient like this should work:

# normal expert
alpha_i = 1.0f
beta_i  = 0.0f

# zero-compute expert
alpha_i = 0.0f
beta_i  = 1.0f

And the matrix C is just the MoE input (i.e. x_t from the paper).

ngxson · 2026-01-30T10:32:00Z

It can be simpler to explain the mul_mat_id in simple terms like this:

A normal mul_mat(A, B) calculates A @ B = C, simple.
Now, instead of having just a single B, we have B as a stack of multiple matrices: B = (B0, B1, B2, ..., Bn)
For MoE, we want to mul_mat A with a subset of B, something like: A @ (B2, B8, ...)
So mul_mat_id takes an extra param, the indexes of elements in B to be used: mul_mat_id(A, B, (2, 8)) --> (A @ B2, A @ B8)

As @ggerganov suggested, I imagine the mul_mat_id will now take an extra alpha, beta params:

idx = (2, 8)
alpha = (0.3, 0.0)
mul_mat_id(A, B, idx, alpha) --> (A @ B2 * 0.3, A @ B8 * 0.0)

In the example above, computation for A @ B8 will be skipped as its beta value is 0.0

However, one issue is that the router weight alpha is non-zero. Indeed, it depends on the top_k operation to sort out the activated experts. So, I think just adding a check like if (idx >= B->ne[2]) could be enough? So if B only has 4 experts and the experts ID=5 is accessed, then it's out-of-bound; we skip the mul_mat in such case.

However, yet another problem, even when the idea above is implemented: The output dim of mul_mat_id will not be the same as the input, and more importantly, there are also other ops like mul or gelu in-between FFN gate/up/down. We can resolve this by assuming that output of the skipped mul mat will be all 0.0, but I think that's not a generic solution.

For the calculation of the beta_i*C term, it will also be a bit tricky, as we need to filter out only the coefficients (aka router weights) that correspond to the zero-computing experts. Something like torch.where could be necessary, but that's another rabbit hole I think

ngxson · 2026-01-30T10:34:31Z

Seems like quite more works than I initially thought, so I think we should re-consider if this worth implementing. Currently, only longcat-flash family using this technique, so it can be quite risky to too many infrastructure to support it.

ggerganov · 2026-01-30T11:43:10Z

Yes, seems more complicated. Let's reconsider later in case this architecture shows any promise.

ngxson added 5 commits January 29, 2026 16:06

wip

44bc40f

tokenizer

e79a1f7

rm debug print

8f1c6c0

load ok

be101fc

stuck here

1fa084e

github-actions bot added model Model specific python python script changes labels Jan 29, 2026

ngxson mentioned this pull request Jan 29, 2026

model : support LongCat-Flash-Lite (ngram embeddings) #19167

Draft

pwilkin added the help wanted Needs help from the community label Jan 29, 2026

loci-dev mentioned this pull request Jan 29, 2026

UPSTREAM PR #19182: model: support Longcat-Flash (help wanted) auroralabs-loci/llama.cpp#1072

Open

loci-dev mentioned this pull request Jan 31, 2026

UPSTREAM PR #19182: model: support Longcat-Flash (help wanted) auroralabs-loci/llama.cpp#1108

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

model: support Longcat-Flash (help wanted)#19182

model: support Longcat-Flash (help wanted)#19182
ngxson wants to merge 5 commits intoggml-org:masterfrom
ngxson:xsn/longcat_flash

ngxson commented Jan 29, 2026 •

edited

Loading

Uh oh!

ggerganov commented Jan 29, 2026

Uh oh!

hebangwen commented Jan 30, 2026

Uh oh!

ggerganov commented Jan 30, 2026

Uh oh!

ngxson commented Jan 30, 2026

Uh oh!

ngxson commented Jan 30, 2026

Uh oh!

ggerganov commented Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ngxson commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Jan 29, 2026

Uh oh!

hebangwen commented Jan 30, 2026

Uh oh!

ggerganov commented Jan 30, 2026

Uh oh!

ngxson commented Jan 30, 2026

Uh oh!

ngxson commented Jan 30, 2026

Uh oh!

ggerganov commented Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ngxson commented Jan 29, 2026 •

edited

Loading